AITopics

Country:

Asia > China (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.36)

Neural Information Processing SystemsDec-23-2025, 22:17:47 GMT

Learning Representations from Audio-Visual Spatial Alignment

We introduce a novel self-supervised pretext task for learning representations from audio-visual content.

artificial intelligence, proceedings, representation, (9 more...)

Technology: Information Technology > Artificial Intelligence (0.43)

Olalere, Feyisayo, van der Heijden, Kiki, Stronks, H. Christiaan, Briaire, Jeroen, Frijns, Johan H. M., Güçlütürk, Yagmur

Speech Separation for Hearing-Impaired Children in the Classroom

arXiv.org Artificial IntelligenceNov-12-2025

The process includes simulating room and listener acoustic properties (A), modeling talkers' movement trajectories (B), and synthesizing classroom speech mixtures (C). The numbers (1) - (5) correspond to the steps itemized in section II-B more challenging and reflective of classroom acoustics. The separation model is trained to output time-domain waveforms for each speaker with no interference from the other speaker or background noise. This setup enables the model to not only separate overlapping speech, but also to preserve spatial distinctions associated with each moving source. B. Simulation of Overlapping Speech for Classroom Conditions To capture the reverberant and spatial characteristics typical of classroom environments, we developed a spatialization pipeline for generating training and evaluation data (see Fig.1). This pipeline consists of five main components, which are explained below in detail: 1) Simulation of room impulse responses (RIRs) 2) Application of head-related impulse responses (HRIRs) 3) Generation of binaural room impulse responses (BRIRs) 4) Modeling of talkers' movement trajectories 5) Synthesis of the classroom speech data 1) Room Impulse Responses: To simulate naturalistic reverberant classroom acoustics, we generated RIRs that capture direct sound, early reflections, and reverberation or echo. These RIRs were used to spatialize source signals in simulated classroom environments with varying geometry, reverberation, and source-listener distances. We used the Pyroomacoustics Python package [35], which implements the image source method to model sound propagation in rectangular (shoebox) rooms. A total of 30 classrooms were simulated, with dimensions randomly sampled from a range of 8.5 8.5 3 m to 10 10 3.5 m (length width height), reflecting typical U.S. classroom sizes [36], [37].

artificial intelligence, machine learning, natural language, (18 more...)

2511.07677

Country: Europe > Netherlands (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry:

Education > Educational Setting (1.00)
Health & Medicine > Therapeutic Area > Otolaryngology (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Neural Information Processing SystemsOct-9-2025, 04:49:46 GMT

Diversifying Spatial-Temporal Perception for Video Domain Generalization Kun-Y u Lin

However, existing video classification models rely on the i.i.d.

artificial intelligence, computer vision, machine learning, (16 more...)

Country:

Asia > China (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.36)

arXiv.org Artificial IntelligenceSep-30-2025

CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP

An, Na Min, Kang, Inha, Lee, Minhyun, Shim, Hyunjung

Spatial grounding is crucial for referring image segmentation (RIS), where the goal of the task is to localize an object described by language. Current foundational vision-language models (VLMs), such as CLIP, excel at aligning images and text but struggle with understanding spatial relationships. Within the language stream, most existing methods often focus on the primary noun phrase when extracting local text features, undermining contextual tokens. Within the vision stream, CLIP generates similar features for images with different spatial layouts, resulting in limited sensitivity to spatial structure. To address these limitations, we propose \textsc{CoPatch}, a zero-shot RIS framework that leverages internal model components to enhance spatial representations in both text and image modalities. For language, \textsc{CoPatch} constructs hybrid text features by incorporating context tokens carrying spatial cues. For vision, it extracts patch-level image features using our novel path discovered from intermediate layers, where spatial structure is better preserved. These enhanced features are fused into a clustered image-text similarity map, \texttt{CoMap}, enabling precise mask selection. As a result, \textsc{CoPatch} significantly improves spatial grounding in zero-shot RIS across RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut (+ 2--7 mIoU) without requiring any additional training. Our findings underscore the importance of recovering and leveraging the untapped spatial knowledge inherently embedded in VLMs, thereby paving the way for opportunities in zero-shot RIS.

atch, large language model, natural language, (21 more...)

2509.23098

Country:

Europe > Switzerland (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceSep-16-2025

SonicSieve: Bringing Directional Speech Extraction to Smartphones Using Acoustic Microstructures

Yuan, Kuang, Wang, Yifeng, Zhang, Xiyuxing, Shen, Chengyi, Kumar, Swarun, Chan, Justin

Imagine placing your smartphone on a table in a noisy restaurant and clearly capturing the voices of friends seated around you, or recording a lecturer's voice with clarity in a reverberant auditorium. We introduce SonicSieve, the first intelligent directional speech extraction system for smartphones using a bio-inspired acoustic microstructure. Our passive design embeds directional cues onto incoming speech without any additional electronics. It attaches to the in-line mic of low-cost wired earphones which can be attached to smartphones. We present an end-to-end neural network that processes the raw audio mixtures in real-time on mobile devices. Our results show that SonicSieve achieves a signal quality improvement of 5.0 dB when focusing on a 30° angular region. Additionally, the performance of our system based on only two microphones exceeds that of conventional 5-microphone arrays.

artificial intelligence, deep learning, machine learning, (12 more...)

2504.10793

Country: North America > United States (0.29)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Industry: Information Technology (1.00)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceAug-12-2025

AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies

Dai, Yinpei, Lee, Jayjun, Zhang, Yichi, Ma, Ziqiao, Yang, Jed, Zadeh, Amir, Li, Chuan, Fazeli, Nima, Chai, Joyce

In this paper, we propose AimBot, a lightweight visual augmentation technique that provides explicit spatial cues to improve visuomotor policy learning in robotic manipulation. AimBot overlays shooting lines and scope reticles onto multi-view RGB images, offering auxiliary visual guidance that encodes the end-effector's state. The overlays are computed from depth images, camera extrinsics, and the current end-effector pose, explicitly conveying spatial relationships between the gripper and objects in the scene. AimBot incurs minimal computational overhead (less than 1 ms) and requires no changes to model architectures, as it simply replaces original RGB images with augmented counterparts. Despite its simplicity, our results show that AimBot consistently improves the performance of various visuomotor policies in both simulation and real-world settings, highlighting the benefits of spatially grounded visual feedback.

large language model, machine learning, natural language, (20 more...)

2508.08113

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.34)

arXiv.org Artificial IntelligenceApr-29-2025

Spatial Speech Translation: Translating Across Space With Binaural Hearables

Chen, Tuochao, Wang, Qirui, He, Runlin, Gollakota, Shyam

Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer's environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system's effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation.

machine learning, natural language, translation, (20 more...)

2504.18715

Country: North America > United States (1.00)

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.46)

Industry: Health & Medicine (0.92)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Olalere, Feyisayo, van der Heijden, Kiki, Stronks, Christiaan H., Briaire, Jeroen, Frijns, Johan HM, van Gerven, Marcel

Leveraging Spatial Cues from Cochlear Implant Microphones to Efficiently Enhance Speech Separation in Real-World Listening Scenes

arXiv.org Artificial IntelligenceJan-24-2025

Speech separation approaches for single-channel, dry speech mixtures have significantly improved. However, real-world spatial and reverberant acoustic environments remain challenging, limiting the effectiveness of these approaches for assistive hearing devices like cochlear implants (CIs). To address this, we quantify the impact of real-world acoustic scenes on speech separation and explore how spatial cues can enhance separation quality efficiently. We analyze performance based on implicit spatial cues (inherent in the acoustic input and learned by the model) and explicit spatial cues (manually calculated spatial features added as auxiliary inputs). Our findings show that spatial cues (both implicit and explicit) improve separation for mixtures with spatially separated and nearby talkers. Furthermore, spatial cues enhance separation when spectral cues are ambiguous, such as when voices are similar. Explicit spatial cues are particularly beneficial when implicit spatial cues are weak. For instance, single CI microphone recordings provide weaker implicit spatial cues than bilateral CIs, but even single CIs benefit from explicit cues. These results emphasize the importance of training models on real-world data to improve generalizability in everyday listening scenarios. Additionally, our statistical analyses offer insights into how data properties influence model performance, supporting the development of efficient speech separation approaches for CIs and other assistive devices in real-world settings.

artificial intelligence, machine learning, spatial cue, (17 more...)

2501.1461

Country:

Europe > Netherlands > South Holland > Leiden (0.04)
North America > United States (0.04)
Europe > Netherlands > South Holland > Delft (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Consumer Health (0.72)
Health & Medicine > Therapeutic Area > Otolaryngology (0.35)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsOct-9-2024, 21:54:21 GMT

Learning Representations from Audio-Visual Spatial Alignment

We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) further discriminates negative pairs originated from the same video instance but at different moments in time. While these approaches learn high-quality representations for downstream tasks such as action recognition, they completely disregard the spatial cues of audio and visual signals naturally occurring in the real world. To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360\degree video and spatial audio.

audio-visual spatial alignment, correspondence, learning representation, (5 more...)

Technology: Information Technology > Artificial Intelligence (0.71)